Goto

Collaborating Authors

 cross-modal retrieval


Interactive Cross-modal Learning for Text-3DScene Retrieval

Neural Information Processing Systems

Text-3DScene Retrieval (T3SR) aims to retrieve relevant scenes using linguistic queries. Although traditional T3SR methods have made significant progress in capturing fine-grained associations, they implicitly assume that query descriptions are information-complete. In practical deployments, however, limited by the capabilities of users and models, it is difficult or even impossible to directly obtain a perfect textual query suiting the entire scene and model, thereby leading to performance degradation. To address this issue, we propose a novel Interactive Text-3D Scene Retrieval Method (IDeal), which promotes the enhancement of the alignment between texts and 3D scenes through continuous interaction. To achieve this, we present an Interactive Retrieval Refinement framework (IRR), which employs a questioner to pose contextually relevant questions to an answerer in successive rounds that either promote detailed probing or encourage exploratory divergence within scenes. Upon the iterative responses received from the answerer, IRR adopts a retriever to perform both feature-level and semantic-level information fusion, facilitating scene-level interaction and understanding for more precise re-rankings. To bridge the domain gap between queries and interactive texts, we propose an Interaction Adaptation Tuning strategy (IAT).


Dynamic Masking and Auxiliary Hash Learning for Enhanced Cross-Modal Retrieval

Neural Information Processing Systems

The demand for multimodal data processing drives the development of information technology. Cross-modal hash retrieval has attracted much attention because it can overcome modal differences and achieve efficient retrieval, and has shown great application potential in many practical scenarios. Existing cross-modal hashing methods have difficulties in fully capturing the semantic information of different modal data, which leads to a significant semantic gap between modalities. Moreover, these methods often ignore the importance differences of channels, and due to the limitation of a single goal, the matching effect between hash codes is also affected to a certain extent, thus facing many challenges. To address these issues, we propose a Dynamic Masking and Auxiliary Hash Learning (AHLR) method for enhanced cross-modal retrieval.


Dynamic Masking and Auxiliary Hash Learning for Enhanced Cross-Modal Retrieval

Neural Information Processing Systems

The demand for multimodal data processing drives the development of information technology. Cross-modal hash retrieval has attracted much attention because it can overcome modal differences and achieve efficient retrieval, and has shown great application potential in many practical scenarios. Existing cross-modal hashing methods have difficulties in fully capturing the semantic information of different modal data, which leads to a significant semantic gap between modalities. Moreover, these methods often ignore the importance differences of channels, and due to the limitation of a single goal, the matching effect between hash codes is also affected to a certain extent, thus facing many challenges. To address these issues, we propose a Dynamic Masking and Auxiliary Hash Learning (AHLR) method for enhanced cross-modal retrieval.


Easy Regional Contrastive Learning of Expressive Fashion Representations

Neural Information Processing Systems

When learning vision-language models (VLM) for the fashion domain, most existing works design new architectures from vanilla BERT with additional objectives, or perform dense multi-task learning with fashion-specific tasks. Though progress has been made, their architecture or objectives are often intricate and the extendibility is limited.By contrast, with simple architecture (comprising only two unimodal encoders) and just the contrastive objective, popular pre-trained VL models (e.g., CLIP) achieve superior performance in general domains, which are further easily extended to downstream tasks.However, inheriting such benefits of CLIP in the fashion domain is non-trivial in the presence of the notable domain gap. Empirically, we find that directly finetuning on fashion data leads CLIP to frequently ignore minor yet important details such as logos and composition, which are critical in fashion tasks such as retrieval and captioning.In this work, to maintain CLIP's simple architecture and objective while explicitly attending to fashion details, we propose $E^2$: Easy Regional Contrastive Learning of Expressive Fashion Representations.$E^2$


An End-To-End Graph Attention Network Hashing for Cross-Modal Retrieval

Neural Information Processing Systems

Due to its low storage cost and fast search speed, cross-modal retrieval based on hashing has attracted widespread attention and is widely used in real-world applications of social media search. However, most existing hashing methods are often limited by uncomprehensive feature representations and semantic associations, which greatly restricts their performance and applicability in practical applications. To deal with this challenge, in this paper, we propose an end-to-end graph attention network hashing (EGATH) for cross-modal retrieval, which can not only capture direct semantic associations between images and texts but also match semantic content between different modalities. We adopt the contrastive language image pretraining (CLIP) combined with the Transformer to improve understanding and generalization ability in semantic consistency across different data modalities. The classifier based on graph attention network is applied to obtain predicted labels to enhance cross-modal feature representation. We construct hash codes using an optimization strategy and loss function to preserve the semantic information and compactness of the hash code. Comprehensive experiments on the NUS-WIDE, MIRFlickr25K, and MS-COCO benchmark datasets show that our EGATH significantly outperforms against several state-of-the-art methods.





CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching

Neural Information Processing Systems

Binary source code matching, especially on function-level, has a critical role in the field of computer security. Given binary code only, finding the corresponding source code improves the accuracy and efficiency in reverse engineering. Given source code only, related binary code retrieval contributes to known vulnerabilities confirmation. However, due to the vast difference between source and binary code, few studies have investigated binary source code matching. Previously published studies focus on code literals extraction such as strings and integers, then utilize traditional matching algorithms such as the Hungarian algorithm for code matching.


An End-to-End Graph Attention Network Hashing for Cross-Modal Retrieval

Neural Information Processing Systems

Due to its low storage cost and fast search speed, cross-modal retrieval based on hashing has attracted widespread attention and is widely used in real-world applications of social media search.